python machine-learning random-forest pca shap

use SHAP values to explain a PCA-selected dataset

I'm using a RandomForestClassifier model, which I’d like to explain using Shapley values. The data (which contains 150 features) was first run through a PCA selector, which converted it into 3 new features, and then this selected reduced-data was given to a RandomForestClassifier model. The model is then given to shap.Explainer(). The problem is, I’d like the shap to explain the model with the original 150 features, and not with the 3 pca components. Therefore I called the shap.Explainer() with the original data:

#the selector:
fs_all_pca = PCA(n_components=3).fit(X)
X_all_pca = fs_all_pca.transform(X)

#the model:
model = RandomForestClassifier(max_depth=5, min_samples_split=4, n_estimators=200, min_samples_leaf=3, class_weight=class_weights)
model.fit(X_all_pca, y)

#explain with shap:
explainer = shap.Explainer(model.predict, X)
shap_values = explainer(X)

However I get this error:

raise ValueError(
ValueError: X has 150 features, but RandomForestClassifier is expecting 3 features as input.

Is there a way to run shapley for the pca+model, with the original data before it passed through the pca selection stage?

Solution

You're fitting your model on X_all_pca having 3 features:

fs_all_pca = PCA(n_components=3).fit(X)
X_all_pca = fs_all_pca.transform(X)
model.fit(X_all_pca, y)

However, when you're predicting you want to feed all the features:

explainer = shap.Explainer(model.predict, X)

Hence your error message.

It must be more or less:

explainer = shap.Explainer(model.predict, X_all_pca)

If you for some reason want to do analysis for all features (why do PCA then???), do a pipeline and feed it through KernelExplainer