Search code examples
machine-learningscikit-learnshap

Why I get different expected_value when I include the training data in TreeExplainer?


Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.

Reproducible example (run in Google Colab):

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap

shap.__version__
# 0.37.0

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])

# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635

So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).

Both approaches are additive and consistent when used with the (obviously different) respective shap_values:

np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.

What am I missing here?


Solution

  • Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.

    The data is passed to an Independent masker, which does the subsampling:
    https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
    https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
    https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216

    Running

    from shap import maskers
    
    another_gbt_explainer = shap.TreeExplainer(
        gbt,
        data=maskers.Independent(X_train, max_samples=800),
        feature_perturbation="tree_path_dependent"
    )
    another_gbt_explainer.expected_value
    

    gets back to

    -11.534353657511172