Search code examples
pythonmachine-learningxgboostshap

Waterfall Plot with TreeExplainer


Using TreeExplainer in SHAP, I could not plot the Waterfall Plot.

Error Message:

---> 17 shap.plots.waterfall(shap_values[0], max_display=14) 
TypeError: The waterfall plot requires an `Explanation` object as the
`shap_values` argument.

Since my model is tree based, I use TreeExplainer (because of using xgb.XGBClassifier).

If I use the Explainer instead TreeExplainer, I can plot Waterfall Plot.

My code is given below:

import pandas as pd

data = {
    'a': [1, 2, 3, 3, 2, 1, 4, 5, 6, 7, 8, 1, 2, 3, 3, 2, 1, 4, 5, 6, 7, 8],
    'b': [2, 1, 2, 3, 4, 6, 5, 8, 7, 9, 10, 2, 1, 2, 3, 4, 6, 5, 8, 7, 9, 10],
    'c': [1, 5, 2, 4, 3, 9, 6, 8, 7, 10, 1, 1, 5, 2, 4, 3, 9, 6, 8, 7, 10, 1],
    'd': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1],
    'e': [1, 2, 3, 4, 3, 2, 1, 5, 4, 2, 1, 1, 2, 3, 4, 3, 2, 1, 5, 4, 2, 1],
    'f': [1, 1, 2, 1, 2, 2, 3, 3, 3, 2, 1, 1, 1, 2, 1, 2, 2, 3, 3, 3, 2, 1],
    'g': [3, 3, 2, 1, 3, 2, 1, 1, 1, 2, 2, 3, 3, 2, 1, 3, 2, 1, 1, 1, 2, 2],
    'h': [1, 2, 1, 2, 3, 4, 5, 3, 4, 5, 5, 1, 2, 1, 2, 3, 4, 5, 3, 4, 5, 5],
    'i': [1, 2, 1, 2, 3, 4, 5, 6, 5, 4, 6, 1, 2, 1, 2, 3, 4, 5, 6, 5, 4, 6],
    'j': [5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 6],
    'k': [3, 3, 2, 1, 4, 3, 2, 2, 2, 1, 1, 3, 3, 2, 1, 4, 3, 2, 2, 2, 1, 1],
    'r': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1]
}

df = pd.DataFrame(data)

X = df.iloc[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
y = df.iloc[:,11]

from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth'     :   [6],
    'n_estimators'  :   [500],
    'learning_rate' :   [0.3]
}


grid_search_xgboost =   GridSearchCV(
    estimator       =   xgb.XGBClassifier(),
    param_grid      =   param_grid,
    cv              =   3,  
    verbose         =   2,  
    n_jobs          =   -1  
)

grid_search_xgboost.fit(X_train, y_train)

print("Best Parameters:", grid_search_xgboost.best_params_)
best_model_xgboost = grid_search_xgboost.best_estimator_

import shap

explainer = shap.TreeExplainer(best_model_xgboost)
shap_values = explainer.shap_values(X_train)

shap.summary_plot(shap_values, X_train, plot_type="bar")

shap.summary_plot(shap_values, X_train)

for name in X_train.columns:
    shap.dependence_plot(name, shap_values, X_train)

shap.force_plot(explainer.expected_value, shap_values[0], X_train.iloc[0], matplotlib=True)

shap.decision_plot(explainer.expected_value, shap_values[:10], X_train.iloc[:10])

shap.plots.waterfall(shap_values[0], max_display=14)

Where is the problem?


Solution

  • Instead of feeding shap values as numpy.ndarray try an Explanation object:

    import xgboost as xgb
    import shap
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    from sklearn.model_selection import GridSearchCV
    
    data = {
        'a': [1, 2, 3, 3, 2, 1, 4, 5, 6, 7, 8, 1, 2, 3, 3, 2, 1, 4, 5, 6, 7, 8],
        'b': [2, 1, 2, 3, 4, 6, 5, 8, 7, 9, 10, 2, 1, 2, 3, 4, 6, 5, 8, 7, 9, 10],
        'c': [1, 5, 2, 4, 3, 9, 6, 8, 7, 10, 1, 1, 5, 2, 4, 3, 9, 6, 8, 7, 10, 1],
        'd': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1],
        'e': [1, 2, 3, 4, 3, 2, 1, 5, 4, 2, 1, 1, 2, 3, 4, 3, 2, 1, 5, 4, 2, 1],
        'f': [1, 1, 2, 1, 2, 2, 3, 3, 3, 2, 1, 1, 1, 2, 1, 2, 2, 3, 3, 3, 2, 1],
        'g': [3, 3, 2, 1, 3, 2, 1, 1, 1, 2, 2, 3, 3, 2, 1, 3, 2, 1, 1, 1, 2, 2],
        'h': [1, 2, 1, 2, 3, 4, 5, 3, 4, 5, 5, 1, 2, 1, 2, 3, 4, 5, 3, 4, 5, 5],
        'i': [1, 2, 1, 2, 3, 4, 5, 6, 5, 4, 6, 1, 2, 1, 2, 3, 4, 5, 6, 5, 4, 6],
        'j': [5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 6],
        'k': [3, 3, 2, 1, 4, 3, 2, 2, 2, 1, 1, 3, 3, 2, 1, 4, 3, 2, 2, 2, 1, 1],
        'r': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1]
    }
    
    df = pd.DataFrame(data)
    
    X = df.iloc[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
    y = df.iloc[:,11]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)
    
    param_grid = {
        'max_depth'     :   [6],
        'n_estimators'  :   [500],
        'learning_rate' :   [0.3]
    }
       
    grid_search_xgboost =   GridSearchCV(
        estimator       =   xgb.XGBClassifier(),
        param_grid      =   param_grid,
        cv              =   3,  
        verbose         =   2,  
        n_jobs          =   -1  
    )
    
    grid_search_xgboost.fit(X_train, y_train)
    
    print("Best Parameters:", grid_search_xgboost.best_params_)
    best_model_xgboost = grid_search_xgboost.best_estimator_
    
    explainer = shap.TreeExplainer(best_model_xgboost)
    exp = explainer(X_train) # <-- here
    print(type(exp))
    shap.plots.waterfall(exp[0])
    

    <class 'shap._explanation.Explanation'>
    

    enter image description here

    Why?

    Because SHAP has 2 plotting interfaces: old and new one. The old one (your first 2 plots) expects shap values as numpy's ndarray. The new one expects an Explanation object (which is BTW clearly stated in the error message).