Including the training data in SHAP TreeExplainer gives different expected_value
in scikit-learn GBT Regressor.
Reproducible example (run in Google Colab):
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635
So, the expected value when including the training data gbt_data_explainer.expected_value
is quite different from the one calculated without supplying the data (gbt_explainer.expected_value
).
Both approaches are additive and consistent when used with the (obviously different) respective shap_values
:
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
but I wonder why they do not provide the same expected_value
, and why gbt_data_explainer.expected_value
is so different from the mean value of predictions.
What am I missing here?
Apparently shap
subsets to 100 rows when data
is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5...
being reported is the average model output for those 100 rows.
The data
is passed to an Independent
masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216
Running
from shap import maskers
another_gbt_explainer = shap.TreeExplainer(
gbt,
data=maskers.Independent(X_train, max_samples=800),
feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value
gets back to
-11.534353657511172