I'm trying to create a force_plot for my Random Forest model that has two classes (1 and 2), but I am a bit confused about the parameters for the force_plot.
I have two different force_plot parameters I can provide the following:
shap.force_plot(explainer.expected_value[0], shap_values[0], choosen_instance, show=True, matplotlib=True)
shap.force_plot(explainer.expected_value[1], shap_values[1], choosen_instance, show=True, matplotlib=True)
So my questions are:
When creating the force_plot, I must supply expected_value. For my model I have two expected values: [0.20826239 0.79173761], how do I know which to use? My understanding of expected value is that it is the average prediction of my model on train data. Are there two values because I have both class_1 and class_2? So for class_1, the average prediction is 0.20826239 and class_2, it is 0.79173761?
The next parameter is the shap_values, for my chosen instance:
index B G R Prediction
113833 107 119 237 2
I get the following SHAP_values:
[array([[ 0.01705462, -0.01812987, 0.23416978]]),
array([[-0.01705462, 0.01812987, -0.23416978]])]
I don't quite understand why I get two sets of SHAP values? Is one for class_1 and one for class_2? I have been trying to compare the images I attached, given both sets of SHAP values and expected value but I can't really explain what is going on in terms of the prediction.
Let's try reproducible:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap.maskers import Independent
from scipy.special import expit, logit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
Then, your SHAP expected values are:
masker = Independent(data = X_train)
explainer = TreeExplainer(model, data=masker)
ev = explainer.expected_value
ev
array([0.35468973, 0.64531027])
This is what your model would predict on average given background dataset (fed to explainer above):
model.predict_proba(masker.data).mean(0)
array([0.35468973, 0.64531027])
Then, if you have a datapoint of interest:
data_to_explain = X_train[[0]]
model.predict_proba(data_to_explain)
array([[0.00470234, 0.99529766]])
You can achieve exactly the same with SHAP values:
sv = explainer.shap_values(data_to_explain)
np.array(sv).sum(2).ravel()
array([-0.34998739, 0.34998739])
Note, they are symmetrical, because what increase chances towards class 1
decreases chances for 0
by the same amount.
With base values and SHAP values, the probabilities (or chances for a data point to end up in leaf 0
or 1
) are:
ev + np.array(sv).sum(2).ravel()
array([0.00470234, 0.99529766])
Note, this is same as model predictions.