I am fairly new to programming and this problem might be pretty easy to fix, but I have been stuck on it for a while now and I think my approach is just plainly wrong. As the title indicates, I have been trying to implement a gridsearch on my RandomForest prediction to find the best possible parameters for my model and then see the most important features of the model with the best parameters. The packages I've used:
import nltk
from nltk.corpus import stopwords
import pandas as pd
import string
import re
import pickle
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
After some datacleaning and preprocessing, I made a gridsearch like this, where x_features is the DataFrame with the tfidfvectorized features of my data:
param = {'n_estimators':[10, 50, 150], 'max_depth':[10, 30, 50, None], 'min_impurity_decrease':[0, 0.01, 0.05, 0.1], 'class_weight':["balanced", None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(x_features, mydata['label'])
optimal_param = pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending = False)[0:5]
optimal_param1 = gs_fit.best_params_
My idea here was, that maybe I could make it easy for myself and copy in the optimal_param1 into my RandomForestClassifier(), and fit it on my training data more or less like this:
rf = RandomForestClassifier(optimal_param2)
rf_model= rf.fit(x_train, y_train)
but optimal_param2 is a dict. Therefore I thought transforming it to a string and getting rid of the signs that are too much ( sub : for =, delete {, delete } ) would make it work. That obviously failed as the numbers for n_estimators, max_depth etc. were still strings and it expected integers. What i wanted to achieve in the end was to get an output of the most important features more or less like this:
top25_features = sorted(zip(rf_model.feature_importances_, x_train.columns),reverse=True)[0:25]
I realize that gs is already a complete RF model, but it does not have the attribute feature_importances_ which i was looking for. I would be very thankful for any ideas on how to make it work.
Once you have run gs_fit=gs.fit(X,y)
, you have everything you need and you don't need to do any retraining.
First, you can access what was the best model by doing:
best_estimator = gs_fit.best_estimator_
This is returning the Random Forest that yielded the best results. Then you can access this model's feature importances by doing
best_features = best_estimator.feature_importances_
Obviously, you can chain these and directly do:
best_features = gs_fit.best_estimator_.feature_importances_