Search code examples
pythonpython-3.xscikit-learnshap

shap.Explainer constructor error asking for undocumented positional argument


I'm using the python shap package to better understand my machine learning model. (From the documentation: "SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model." Below is a small reproducible example of the error I'm getting:

Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import shap
>>> shap.__version__
'0.37.0'
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> 
>>> iris = shap.datasets.iris()
>>> X_train, X_test, y_train, y_test = train_test_split(*iris, random_state=1)
>>> model = LogisticRegression(penalty='none', max_iter = 1000, random_state=1)
>>> model.fit(X_train, y_train)
>>> 
>>> explainer = shap.Explainer(model, data=X_train, masker=shap.maskers.Impute(),
...                            feature_names=X_train.columns, algorithm="linear")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() missing 1 required positional argument: 'data'

Based on the stack trace, the error appears to occur in the top level function call not within the call to Impute(). I have also tried leaving out the data= part and this throws the same error. This seems very strange to me since the neither the Explainer object's documentation nor source code mentions any data argument (I verified it's from the same package version I'm using):

__init__(model, masker=None, link=CPUDispatcher(<function identity>), algorithm='auto', output_names=None, feature_names=None, **kwargs)

Any ideas? Is this a bug, or am I missing something obvious?


Solution

  • The init signature of Impute is:

    def __init__(self, data, method="linear")
    

    Hence your error. So, instead of:

    explainer = shap.Explainer(model, data=X_train, masker=shap.maskers.Impute(),
                               feature_names=X_train.columns, algorithm="linear")
    

    you should feed X_trainto masker:

    explainer = shap.Explainer(model, masker=shap.maskers.Impute(data=X_train),
                               feature_names=X_train.columns, algorithm="linear")
    

    because it's masker that takes care of data in the new API.

    Unfortunately, even this won't work, because Impute masker implies feature_perturbation = "correlation_dependent" and it doesn't seem ready

    Though, Independent masker is working well:

    import shap
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    
    iris = shap.datasets.iris()
    X_train, X_test, y_train, y_test = train_test_split(*iris, random_state=1)
    model = LogisticRegression(penalty="none", max_iter=1000, random_state=1)
    model.fit(X_train, y_train)
    
    masker = shap.maskers.Independent(data=X_test)
    
    explainer = shap.Explainer(
        model, masker=masker, feature_names=X_train.columns, algorithm="linear"
    )
    
    sv = explainer(X_test)
    sv.base_values[0]
    

    array([-5.0060995 , 13.03460398, -8.02850448])
    

    and if you happen to have missing data in your dataset you may impute data yourself, according to your preferred imputation strategy, and feed it to Independent.