Search code examples
pythonscikit-learnxgbregressor

Extract feature names from XGBRegressor used in scikit-learn pipeline with OneHotEncoded categorical features


I have a dataset with a few numierical and a few categorical features. After calling fit on the XGBRegressor, I want to check the feature importance. Fo this, I want to map the feature importance scores to the feature names the regressor is used in a scikit learn pipeline.

categorical_encoder = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)
encoder = ColumnTransformer(
    transformers=[
        ("categories", categorical_encoder, ["cat_feature1", "cat_feature2", "cat_feature3"])
    ],
    remainder="passthrough"
)
pipeline = Pipeline([
    ("encoder", encoder),
    ("regressor", XGBRegressor())
])

pipeline['regressor'].get_booster().get_fscore() returns a dictionary with feature names f0, f2, f7, ... pipeline['encoder'].named_transformers_['categories']['encoder'].get_feature_names_out() returns the feature names of the one hot encoded categorical variables.

Can I somehow get the full feature list whcih has been created in the pipeline and map it to the feature importance scores? I could not really figure it out by myself.


Solution

  • All the composition objects (Pipeline and ColumnTransformer) support get_feature_names_out too, so

    pipeline['encoder'].get_feature_names_out()`
    

    should work fine. More generally, when you have a pipeline whose last step is a model, slicing works well:

    pipeline[:-1].get_feature_names_out()