Search code examples
pythonlightgbmshap

SHAP not working with LightGBM categorical features


My model uses LGBMClassifier. I'd like to use Shap (Shapley) to interpret features. However, Shap gave me errors on categorical features. For example, I have a feature "Smoker" and its values include "Yes" and "No". I got an error from Shap:

ValueError: could not convert string to float: 'Yes'.

Am I missing any settings?

BTW, I know that I could use one-hot encoding to convert categorical features but I don't want to, since LGBMClassifier can handle categorical features without one-hot encoding.

Here's the sample code: (shap version is 0.40.0, lightgbm version is 3.3.2)

import pandas as pd
from lightgbm import LGBMClassifier #My version is 3.3.2
import shap #My version is 0.40.0

#The training data
X_train = pd.DataFrame()
X_train["Age"] = [50, 20, 60, 30]
X_train["Smoker"] = ["Yes", "No", "No", "Yes"]

#Target: whether the person had a certain disease
y_train = [1, 0, 0, 0]
#I did convert categorical features to the Category data type.
X_train["Smoker"] = X_train["Smoker"].astype("category")

#The test data
X_test = pd.DataFrame()
X_test["Age"] = [50]
X_test["Smoker"] = ["Yes"]
X_test["Smoker"] = X_test["Smoker"].astype("category")

#the classifier    
clf = LGBMClassifier()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

#shap
explainer = shap.TreeExplainer(clf)
#I see this setting from google search but it did not really help
explainer.model.original_model.params = {"categorical_feature":["Smoker"]}
shap_values = explainer(X_train) #the error came out here: ValueError: could not convert string to float: 'Yes'

Solution

  • Let's try slightly different:

    from lightgbm import LGBMClassifier
    import shap
    
    X_train = pd.DataFrame({
        "Age": [50, 20, 60, 30], 
        "Smoker": ["Yes", "No", "No", "Yes"]}
    )
    X_train["Smoker"] = X_train["Smoker"].astype("category")
    y_train = [1, 0, 0, 0]
    
    X_test = pd.DataFrame({"Age": [50], "Smoker": ["Yes"]})
    X_test["Smoker"] = X_test["Smoker"].astype("category")
    
    
    clf = LGBMClassifier(verbose=-1).fit(X_train, y_train)
    predicted = clf.predict(X_test)
    print("Predictions:", predicted)
    
    exp = shap.TreeExplainer(clf)
    sv = exp.shap_values(X_train) # <-- here
    
    print(f"Expected values: {exp.expected_value}")
    print(f"SHAP values for 0th data point: {sv[1][0]}")
    

    Predictions: [0]
    Expected values: [1.0986122886681098, -1.0986122886681098]
    SHAP values for 0th data point: [0. 0.]
    

    Note, you don't need to tinker with explainer.model.original_model.params as it gives you non-intended public access to the model's params, which are already set for you by virtue of training model.