Search code examples
pythonmachine-learningclassificationfeature-selection

Feature selection methodology to reduce Overfit in classification model


My dataset has over 200 variables and I am running a classification model on it, which is leading to a model OverFit. Which suggested for reducing the number of features? I started with Feature Importance, however due to such a large number of variables, I am unable to visualise it. Is there a way I can plot or showcase these values with respect to the given variable?

Below is the code that am trying:

F_Select = ExtraTreesClassifier(n_estimators=50)
F_Select.fit(X_train,y_train)
print(F_Select.feature_importances_)

Solution

  • You could try plotting the feature importances from largest to smallest and seeing which features capture a certain amount (say 95%) of the variance, like a scree plot used in PCA. Ideally, this should be a small number of features:

    import matplotlib.pyplot as plt
    from sklearn import *
    model = ensemble.RandomForestClassifier()
    model.fit(features, labels)
    model.feature_importances_
    importances = np.sort(model.feature_importances_)[::-1]
    cumsum = np.cumsum(importances)
    plt.bar(range(len(importances)), importances)
    plt.plot(cumsum)