Shap summary plots for XGBoost with categorical data inputs

XGBoost supports inputting features as categories directly, which is very useful when there are a lot of categorical variables. This doesn't seem to be compatible with Shap:

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')

# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)

Throws an error: ValueError: DataFrame.dtypes for data must be int, float, bool or category.

Is it possible to use Shap in this situation?

Solution

Unfortunately, generating shap values with xgboost using categorical variables is an open issue. See, f.e., https://github.com/slundberg/shap/issues/2662

Given your specific example, I made it run using Dmatrix as input of shap (Dmatrix is the basic data type input of xgboost models, see the Learning API. The sklearn api, that you are using, doesn't need the Dmatrix, at least for training):

import pandas as pd
import xgboost as xgb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
print(shap_values)

But the ability to generate shap values when there are categorical variables is very unstable: f.e., if you add other parameters in the xgboost you get the error "Check failed: !HasCategoricalSplit()", which is the error referenced in my first link

import pandas as pd
import xgboost as xgb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(colsample_bylevel= 0.7, 
                             enable_categorical=True,
                             tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
shap_values

I've searched for a solution for months but, to conclude, as for my understanding, it is not really possible yet to generate shap values with xgboost and categorical variables (I hope someone can contradict me, with a reproducible example). I suggest you try with the Catboost

########################## EDIT ############################

An example with Catboost

import pandas as pd
import catboost as cb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())

model = cb.CatBoostRegressor(iterations=100)
model.fit(test_data.drop('target', axis=1), test_data['target'],
                    cat_features=['feature_2'], verbose=False)

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data.drop('target', axis=1))
shap_values
print('shap values: \n',shap_values)