My dataset has over 200 variables and I am running a classification model on it, which is leading to a model OverFit. Which suggested for reducing the number of features? I started with Feature Importance, however due to such a large number of variables, I am unable to visualise it. Is there a way I can plot or showcase these values with respect to the given variable?
Below is the code that am trying:
F_Select = ExtraTreesClassifier(n_estimators=50)
F_Select.fit(X_train,y_train)
print(F_Select.feature_importances_)
You could try plotting the feature importances from largest to smallest and seeing which features capture a certain amount (say 95%) of the variance, like a scree plot used in PCA. Ideally, this should be a small number of features:
import matplotlib.pyplot as plt
from sklearn import *
model = ensemble.RandomForestClassifier()
model.fit(features, labels)
model.feature_importances_
importances = np.sort(model.feature_importances_)[::-1]
cumsum = np.cumsum(importances)
plt.bar(range(len(importances)), importances)
plt.plot(cumsum)