Search code examples
pythonpandasmachine-learningrandom-forestshap

Get waterfall plot values of a feature in a dataframe using shap package


I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below

With the help of Sergey Bushmanaov's SO post here, I managed to export the waterfall plot to dataframe. But this doesn't copy the feature values of the columns. It only copies the shap values, expected_value and feature names. But I want feature names as well. So, I tried the below

shap.waterfall_plot(shap.Explanation(values=shap_values[1])[4],base_values=explainer.expected_value[1],data=ord_test_t.iloc[4],feature_names=ord_test_t.columns.tolist())

but this threw an error

TypeError: waterfall() got an unexpected keyword argument 'base_values'

I expect my output to be like as below. I have used background of 1 point to compute base value. But you are free to use background 1,10 or 100 as well. In the below output, I have stored the values and feature in one column called Feature. This is something similar to LIME. But not sure whether SHAP has this flexibility to do this?

enter image description here

update - plot

enter image description here

update code - kernel explainer waterfall to dataframe

masker = Independent(X_train, max_samples=100)
explainer = KernelExplainer(rf_boruta.predict,X_train)
bv = explainer.expected_value
sv = explainer.shap_values(X_train)

sdf_train = pd.DataFrame({
    'row_id': X_train.index.values.repeat(X_train.shape[1]),
    'feature': X_train.columns.to_list() * X_train.shape[0],
    'feature_value': X_train.values.flatten(),
    'base_value': bv,
    'shap_values': sv.values[:,:,1].flatten()   # i changed this to pd.DataFrame(sv).values[:,1].flatten()
})

Solution

  • Try following:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_breast_cancer
    from shap import TreeExplainer, Explanation
    from shap.plots import waterfall
    
    import shap
    print(shap.__version__)
    
    X, y = load_breast_cancer(return_X_y=True, as_frame=True)
    model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
    explainer = TreeExplainer(model)
    sv = explainer(X)
    exp = Explanation(sv.values[:,:,1], 
                      sv.base_values[:,1], 
                      data=X.values, 
                      feature_names=X.columns)
    idx = 0
    waterfall(exp[idx])
    

    0.39.0
    

    enter image description here

    Then:

    pd.DataFrame({
        'row_id':idx,
        'feature': X.columns,
        'feature_value': exp[idx].values,
        'base_value': exp[idx].base_values,
        'shap_values': exp[idx].values
    })
    

    #expected output
    row_id  feature feature_value   base_value  shap_values
    0   0   mean radius -0.035453   0.628998    -0.035453
    1   0   mean texture    0.047571    0.628998    0.047571
    2   0   mean perimeter  -0.036218   0.628998    -0.036218
    3   0   mean area   -0.041276   0.628998    -0.041276
    4   0   mean smoothness -0.006842   0.628998    -0.006842
    5   0   mean compactness    -0.009275   0.628998    -0.009275
    6   0   mean concavity  -0.035188   0.628998    -0.035188
    7   0   mean concave points -0.051165   0.628998    -0.051165
    8   0   mean symmetry   -0.002192   0.628998    -0.002192
    9   0   mean fractal dimension  0.001521    0.628998    0.001521
    10  0   radius error    -0.021223   0.628998    -0.021223
    11  0   texture error   -0.000470   0.628998    -0.000470
    12  0   perimeter error -0.021423   0.628998    -0.021423
    13  0   area error  -0.035313   0.628998    -0.035313
    14  0   smoothness error    -0.000060   0.628998    -0.000060
    15  0   compactness error   0.001053    0.628998    0.001053
    16  0   concavity error -0.002988   0.628998    -0.002988
    17  0   concave points error    0.000140    0.628998    0.000140
    18  0   symmetry error  0.001238    0.628998    0.001238
    19  0   fractal dimension error -0.001097   0.628998    -0.001097
    20  0   worst radius    -0.050027   0.628998    -0.050027
    21  0   worst texture   0.038056    0.628998    0.038056
    22  0   worst perimeter -0.079717   0.628998    -0.079717
    23  0   worst area  -0.072312   0.628998    -0.072312
    24  0   worst smoothness    -0.006917   0.628998    -0.006917
    25  0   worst compactness   -0.016184   0.628998    -0.016184
    26  0   worst concavity -0.022500   0.628998    -0.022500
    27  0   worst concave points    -0.088697   0.628998    -0.088697
    28  0   worst symmetry  -0.026166   0.628998    -0.026166
    29  0   worst fractal dimension -0.007683   0.628998    -0.007683
    

    RandomForest is a bit special, this is why. When something fails with the new plots API, try to feed Explanation object.

    UPDATE

    To explain a single datapoint exp_id vs a single background datapoint back_id (i.e. to answer question "why prediction for exp_id differes from prediction for back_id"):

    back_id = 10
    exp_id = 20
    explainer = TreeExplainer(model, data=X.loc[[back_id]])
    sv = explainer(X.loc[[exp_id]])
    exp = Explanation(sv.values[:,:,1], sv.base_values[:,1], data=X.loc[[back_id]].values, feature_names=X.columns)
    waterfall(exp[0])
    

    enter image description here

    Finally, as you asked for everything in the suggested format:

    from shap.maskers import Independent
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
    
    masker = Independent(X_train, max_samples=100)
    explainer = TreeExplainer(model, data=masker)
    bv = explainer.expected_value[1]
    sv = explainer(X_test, check_additivity=False)
    
    pd.DataFrame({
        'row_id': X_test.index.values.repeat(X_test.shape[1]),
        'feature': X_test.columns.to_list() * X_test.shape[0],
        'feature_value': X_test.values.flatten(),
        'base_value': bv,
        'shap_values': sv.values[:,:,1].flatten()
    })
    

    but I'd definitely not show this to my mom.