Search code examples
pythonscikit-learnrandom-forestshapxgbclassifier

Shap value dimensions are different for RandomForest and XGB why/how? Is there something one can do about this?


The SHAP values returned from tree explainer's .shap_values(some_data) gives different dimensions/results for XGB as for random forest. I've tried looking into it, but can't seem to find why or how, or an explanation in any of Slundberg's (SHAP dude's) tutorials. So:

  • Is there a reason for this that I am missing?
  • Is there some flag that returns shap values fro XGB per class like for other models that is not obvious or that I am missing?

Below is some sample code!

import xgboost.sklearn as xgb
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import shap

bc = load_breast_cancer()
cancer_df = pd.DataFrame(bc['data'], columns=bc['feature_names'])
cancer_df['target'] = bc['target']
cancer_df = cancer_df.iloc[0:50, :]
target = cancer_df['target']
cancer_df.drop(['target'], inplace=True, axis=1)

X_train, X_test, y_train, y_test = train_test_split(cancer_df, target, test_size=0.33, random_state = 42)

xg = xgb.XGBClassifier()
xg.fit(X_train, y_train)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

xg_pred = xg.predict(X_test)
rf_pred = rf.predict(X_test)

rf_explainer = shap.TreeExplainer(rf, X_train)
xg_explainer = shap.TreeExplainer(xg, X_train)

rf_vals = rf_explainer.shap_values(X_train)
xg_vals = xg_explainer.shap_values(X_train)

print('Random Forest')
print(type(rf_vals))
print(type(rf_vals[0]))
print(rf_vals[0].shape)
print(rf_vals[1].shape)

print('XGBoost')
print(type(xg_vals))
print(xg_vals.shape)

Output:

Random Forest
<class 'list'>
<class 'numpy.ndarray'>
(33, 30)
(33, 30)
XGBoost
<class 'numpy.ndarray'>
(33, 30)

Solution

  • For binary classification:

    • SHAP values for XGBClassifier (sklearn API) are raw values for 1 class (one dimensional)
    • SHAP values for RandomForestClassifier are probabilities for 0 and 1 class (two dimensional).

    DEMO

    from xgboost import XGBClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from shap import TreeExplainer
    from scipy.special import expit
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    xgb = XGBClassifier(
        max_depth=5, n_estimators=100, eval_metric="logloss", use_label_encoder=False
    ).fit(X_train, y_train)
    xgb_exp = TreeExplainer(xgb)
    xgb_sv = np.array(xgb_exp.shap_values(X_test))
    xgb_ev = np.array(xgb_exp.expected_value)
    
    print("Shape of XGB SHAP values:", xgb_sv.shape)
    
    rf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
    rf_exp = TreeExplainer(rf)
    rf_sv = np.array(rf_exp.shap_values(X_test))
    rf_ev = np.array(rf_exp.expected_value)
    
    print("Shape of RF SHAP values:", rf_sv.shape)
    

    Shape of XGB SHAP values: (143, 30)
    Shape of RF SHAP values: (2, 143, 30)
    

    Interpretaion:

    • XGBoost (143,30) dimensions:
      • 143: number of samples in test
      • 30: number of features
    • RF (2,143,30) dimensions:
      • 2: number of output classes
      • 143: number of samples
      • 30: number of features

    To compare xgboost SHAP values to predicted probabilities, and thus classes, you may try adding SHAP values to base (expected) values. For 0th datapoint in test it will be:

    xgb_pred = expit(xgb_sv[0,:].sum() + xgb_ev)
    assert np.isclose(xgb_pred, xgb.predict_proba(X_test)[0,1])
    

    To compare RF SHAP values to predicted probabilities for 0th datapoint:

    rf_pred = rf_sv[1,0,:].sum() + rf_ev[1]
    assert np.isclose(rf_pred, rf.predict_proba(X_test)[0,1])
    

    Note, this analysis applies to (i) sklearn API and (ii) binary classification.