Search code examples
pythonxgboostshap

Shap summary plots for XGBoost with categorical data inputs


XGBoost supports inputting features as categories directly, which is very useful when there are a lot of categorical variables. This doesn't seem to be compatible with Shap:

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')

# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)

Throws an error: ValueError: DataFrame.dtypes for data must be int, float, bool or category.

Is it possible to use Shap in this situation?


Solution

  • Unfortunately, generating shap values with xgboost using categorical variables is an open issue. See, f.e., https://github.com/slundberg/shap/issues/2662

    Given your specific example, I made it run using Dmatrix as input of shap (Dmatrix is the basic data type input of xgboost models, see the Learning API. The sklearn api, that you are using, doesn't need the Dmatrix, at least for training):

    import pandas as pd
    import xgboost as xgb
    import shap
    
    # Test data
    test_data = pd.DataFrame({'target':[23,42,58,29,28],
                          'feature_1' : [38, 83, 38, 28, 57],
                          'feature_2' : ['A', 'B', 'A', 'C','A']})
    test_data['feature_2'] = test_data['feature_2'].astype('category')
    print(test_data.info())
    # Fit xgboost
    model = xgb.XGBRegressor(enable_categorical=True,
                                           tree_method='hist')
    model.fit(test_data.drop('target', axis=1), test_data['target'] )
    
    # Explain with Shap
    test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(test_data_dm)
    print(shap_values)
    

    But the ability to generate shap values when there are categorical variables is very unstable: f.e., if you add other parameters in the xgboost you get the error "Check failed: !HasCategoricalSplit()", which is the error referenced in my first link

    import pandas as pd
    import xgboost as xgb
    import shap
    
    # Test data
    test_data = pd.DataFrame({'target':[23,42,58,29,28],
                          'feature_1' : [38, 83, 38, 28, 57],
                          'feature_2' : ['A', 'B', 'A', 'C','A']})
    test_data['feature_2'] = test_data['feature_2'].astype('category')
    print(test_data.info())
    # Fit xgboost
    model = xgb.XGBRegressor(colsample_bylevel= 0.7, 
                                 enable_categorical=True,
                                 tree_method='hist')
    model.fit(test_data.drop('target', axis=1), test_data['target'] )
    
    # Explain with Shap
    test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(test_data_dm)
    shap_values
    

    I've searched for a solution for months but, to conclude, as for my understanding, it is not really possible yet to generate shap values with xgboost and categorical variables (I hope someone can contradict me, with a reproducible example). I suggest you try with the Catboost

    ########################## EDIT ############################

    An example with Catboost

    import pandas as pd
    import catboost as cb
    import shap
    
    # Test data
    test_data = pd.DataFrame({'target':[23,42,58,29,28],
                          'feature_1' : [38, 83, 38, 28, 57],
                          'feature_2' : ['A', 'B', 'A', 'C','A']})
    test_data['feature_2'] = test_data['feature_2'].astype('category')
    print(test_data.info())
    
    model = cb.CatBoostRegressor(iterations=100)
    model.fit(test_data.drop('target', axis=1), test_data['target'],
                        cat_features=['feature_2'], verbose=False)
    
    # Explain with Shap
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(test_data.drop('target', axis=1))
    shap_values
    print('shap values: \n',shap_values)