Search code examples
pythonmachine-learningrandom-forestpcashap

use SHAP values to explain a PCA-selected dataset


I'm using a RandomForestClassifier model, which I’d like to explain using Shapley values. The data (which contains 150 features) was first run through a PCA selector, which converted it into 3 new features, and then this selected reduced-data was given to a RandomForestClassifier model. The model is then given to shap.Explainer(). The problem is, I’d like the shap to explain the model with the original 150 features, and not with the 3 pca components. Therefore I called the shap.Explainer() with the original data:

#the selector:
fs_all_pca = PCA(n_components=3).fit(X)
X_all_pca = fs_all_pca.transform(X)

#the model:
model = RandomForestClassifier(max_depth=5, min_samples_split=4, n_estimators=200, min_samples_leaf=3, class_weight=class_weights)
model.fit(X_all_pca, y)

#explain with shap:
explainer = shap.Explainer(model.predict, X)
shap_values = explainer(X)

However I get this error:

raise ValueError(
ValueError: X has 150 features, but RandomForestClassifier is expecting 3 features as input.

Is there a way to run shapley for the pca+model, with the original data before it passed through the pca selection stage?


Solution

  • You're fitting your model on X_all_pca having 3 features:

    fs_all_pca = PCA(n_components=3).fit(X)
    X_all_pca = fs_all_pca.transform(X)
    model.fit(X_all_pca, y)
    

    However, when you're predicting you want to feed all the features:

    explainer = shap.Explainer(model.predict, X)
    

    Hence your error message.

    It must be more or less:

    explainer = shap.Explainer(model.predict, X_all_pca)
    

    If you for some reason want to do analysis for all features (why do PCA then???), do a pipeline and feed it through KernelExplainer