python machine-learning scikit-learn pipeline

Permutation importance using a Pipeline in SciKit-Learn

I am using the exact example from SciKit, which compares permutation_importance with tree feature_importances

As you can see, a Pipeline is used:

rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)

permutation_importance:

Now, when you fit a Pipeline, it will Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Later in the example, they used the permutation_importance on the fitted model:

result = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                random_state=42, n_jobs=2)

Problem: What I don't understand is that the features in the result are still the original non-transformed features. Why is this the case? Is this working correctly? What is the purpose of the Pipeline then?

tree feature_importance: In the same example, when they use the feature_importance, the results are transformed:

tree_feature_importances = (
    rf.named_steps['classifier'].feature_importances_)

I can obviously transform my features and then use permutation_importance, but it seems that the steps presented in the examples are intentional, and there should be a reason why permutation_importance does not transform the features.

Solution

This is the expected behavior. The way permutation importance works is to shuffle the input data and apply it to the pipeline (or the model if that is what you want). In fact, if you want to understand how the initial input data effects the model then you should apply it to the pipeline.

If you are interested in the feature importance of each of the additional feature that is generated by your preprocessing steps, then you should generate the preprocessed dataset with column names and then apply that data to the model (using permutation importance) directly instead of the pipeline.

In most cases people are not interested in learning the impact of the secondary features that the pipeline generates. That is why they use the pipeline here to encompass the preprocessing and modeling steps.