Search code examples
pythonpandasscikit-learnone-hot-encoding

How to get feature importances with column names after randomizedsearch and one hot encoded data?


I wrote following code block. After i found the best estimator, i want to learn feature importance of the model. But i couldn't figure it out how to do it correctly with column names.

scaler = StandardScaler()
ohe = OneHotEncoder(categories=unique_list, sparse=False)

col_transformers = ColumnTransformer([
                          ("scaler_onestep", scaler, numerical_columns),
                          ("ohe_onestep", ohe, categorical_columns)])


param_grid = {
        'XGB__estimator__max_depth': [3, 5, 7, 10],
        'XGB__estimator__learning_rate': [0.01, 0.1],
        'XGB__estimator__n_estimators': [100]}

model = MultiOutputClassifier(xgb.XGBClassifier(objective="binary:logistic"))

#Define a pipeline
pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])

rs_clf = RandomizedSearchCV(pipeline, param_grid, n_iter=3,
                            n_jobs=-1, verbose=2, cv=2, scoring="accuracy", refit=True, random_state=42)

rs_clf.fit(X, y)

This gives me the result of feature importances for first label.

rs_clf.best_estimator_.named_steps["XGB"].estimators_[0].feature_importances_

This gives me the catagories.

rs_clf.best_estimator_.named_steps["preprocessing"].transformers[1][1].categories

result has 389 columns, X has 279 columns, so i can not write it directly, how can i do that for one hot encoded data? How can i find this 389 columns names?


Solution

  • The get_feature_names method is going to be of great help here. At the moment, StandardScaler doesn't support it; since xgboost is completely unaffected by feature scaling, I would suggest dropping it and replacing the numerical portion of the ColumnTransformer with "passthrough". Then rs_clf.best_estimator_.named_steps["preprocessing"].get_feature_names() should give the features in the order they arrive to XGB.