I have a dataset with a few numierical and a few categorical features. After calling fit on the XGBRegressor, I want to check the feature importance. Fo this, I want to map the feature importance scores to the feature names the regressor is used in a scikit learn pipeline.
categorical_encoder = Pipeline(
steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)
encoder = ColumnTransformer(
transformers=[
("categories", categorical_encoder, ["cat_feature1", "cat_feature2", "cat_feature3"])
],
remainder="passthrough"
)
pipeline = Pipeline([
("encoder", encoder),
("regressor", XGBRegressor())
])
pipeline['regressor'].get_booster().get_fscore()
returns a dictionary with feature names f0
, f2
, f7
, ...
pipeline['encoder'].named_transformers_['categories']['encoder'].get_feature_names_out()
returns the feature names of the one hot encoded categorical variables.
Can I somehow get the full feature list whcih has been created in the pipeline and map it to the feature importance scores? I could not really figure it out by myself.
All the composition objects (Pipeline
and ColumnTransformer
) support get_feature_names_out
too, so
pipeline['encoder'].get_feature_names_out()`
should work fine. More generally, when you have a pipeline whose last step is a model, slicing works well:
pipeline[:-1].get_feature_names_out()