I'm using the python shap
package to better understand my machine learning model. (From the documentation: "SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model." Below is a small reproducible example of the error I'm getting:
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import shap
>>> shap.__version__
'0.37.0'
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>>
>>> iris = shap.datasets.iris()
>>> X_train, X_test, y_train, y_test = train_test_split(*iris, random_state=1)
>>> model = LogisticRegression(penalty='none', max_iter = 1000, random_state=1)
>>> model.fit(X_train, y_train)
>>>
>>> explainer = shap.Explainer(model, data=X_train, masker=shap.maskers.Impute(),
... feature_names=X_train.columns, algorithm="linear")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() missing 1 required positional argument: 'data'
Based on the stack trace, the error appears to occur in the top level function call not within the call to Impute()
. I have also tried leaving out the data=
part and this throws the same error. This seems very strange to me since the neither the Explainer
object's documentation nor source code mentions any data
argument (I verified it's from the same package version I'm using):
__init__(model, masker=None, link=CPUDispatcher(<function identity>), algorithm='auto', output_names=None, feature_names=None, **kwargs)
Any ideas? Is this a bug, or am I missing something obvious?
The init signature of Impute
is:
def __init__(self, data, method="linear")
Hence your error. So, instead of:
explainer = shap.Explainer(model, data=X_train, masker=shap.maskers.Impute(),
feature_names=X_train.columns, algorithm="linear")
you should feed X_train
to masker:
explainer = shap.Explainer(model, masker=shap.maskers.Impute(data=X_train),
feature_names=X_train.columns, algorithm="linear")
because it's masker
that takes care of data in the new API.
Unfortunately, even this won't work, because Impute
masker implies feature_perturbation = "correlation_dependent"
and it doesn't seem ready
Though, Independent
masker is working well:
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
iris = shap.datasets.iris()
X_train, X_test, y_train, y_test = train_test_split(*iris, random_state=1)
model = LogisticRegression(penalty="none", max_iter=1000, random_state=1)
model.fit(X_train, y_train)
masker = shap.maskers.Independent(data=X_test)
explainer = shap.Explainer(
model, masker=masker, feature_names=X_train.columns, algorithm="linear"
)
sv = explainer(X_test)
sv.base_values[0]
array([-5.0060995 , 13.03460398, -8.02850448])
and if you happen to have missing data in your dataset you may impute data yourself, according to your preferred imputation strategy, and feed it to Independent
.