Search code examples
pandasscikit-learnfeature-extractionfeature-selectionsklearn-pandas

Why are feature selection results different for Random forest classifier when applied in two different ways


I want to do feature selection and I used Random forest classifier but did differently.

I used sklearn.feature_selection.SelectfromModel(estimator=randomforestclassifer...) and used random forest classifier standalone. It was surprising to find that although I used the same classifier, the results were different. Except for some two features, all others were different. Can someone explain why is it so? Maybe is it because the parameters change in these two cases?


Solution

  • This could be because select_from_model refits the estimator by default and sklearn.ensembe.RandomForestClassifier has two pseudo random parameters: bootsrap, which is set to True by default, and max_features, which is set to 'auto' by default.

    If you did not set a random_state in your randomforestclassifier estimator, then it will most likely yield different results every time you fit the model because of the randomness introduced by the bootstrap and max_features parameters, even on the same training data.

    • bootstrap=True means that each tree will be trained on a random sample (with replacement) of a certain percentage of the observations from the training dataset.

    • max_features='auto' means that when building each node, only the square root of the number of features in your training data will be considered to pick the cutoff point that reduces the gini impurity most.

    You can do two things to ensure you get the same results:

    1. Train your estimator first and then use select_from_model(randomforestclassifier, refit=False).
    2. Declare randomforestclassifier with a random seed and then use select_from_model.

    Needless to say, both options require you to pass the same X and y data.