I want to know the names of the features within my RF model. I read here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_
would mirror my columns from my data. However, the length of gs.best_estimator_....
is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, answer2), I would have to declare something within my pipeline. But I am confused as to what to declare because both answers deal with PCA, not RF.
Here is what I have so far.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets
# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_feats)
])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
('model', rf)
])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
,n_jobs = -1
,cv = 5
)
gs.fit(X_train,y_train)
The length of your features does not match because all non-categorical columns are being discarded when you are using your ColumnTransformer
. By default, it only keeps columns for which a transformation was specified. As a result, if you do not want this to happen, you need to do this
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
remainder='passthrough')
(I removed your categorical pipeline, which is not necessary here)
Also keep in mind that applying the OHE will add features and so the total number of features is going to be larger than what you had in the beginning.
Once you have fitted everything, you need to retrieve the feature names for the result of the OHE and the remaining numerical columns.
For the OHE columns:
cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()
For the numerical columns, you need to declare num_feats
where all numerical features are in the same order as in your original dataframe.
Then just do:
feature_names = np.concatenate((cat_features, num_feats))
PS: this is a bit cumbersome, and this might be improved in later sklearn versions, but as of now, this is the procedure