I want to do feature selection and I used Random forest classifier but did differently.
I used sklearn.feature_selection.SelectfromModel(estimator=randomforestclassifer...)
and used random forest classifier standalone. It was surprising to find that although I used the same classifier, the results were different. Except for some two features, all others were different. Can someone explain why is it so? Maybe is it because the parameters change in these two cases?
This could be because select_from_model
refits the estimator by default and sklearn.ensembe.RandomForestClassifier
has two pseudo random parameters: bootsrap
, which is set to True
by default, and max_features
, which is set to 'auto'
by default.
If you did not set a random_state
in your randomforestclassifier
estimator, then it will most likely yield different results every time you fit the model because of the randomness introduced by the bootstrap
and max_features
parameters, even on the same training data.
bootstrap=True
means that each tree will be trained on a random sample (with replacement) of a certain percentage of the observations from the training dataset.
max_features='auto'
means that when building each node, only the square root of the number of features in your training data will be considered to pick the cutoff point that reduces the gini impurity most.
You can do two things to ensure you get the same results:
select_from_model(randomforestclassifier, refit=False)
.randomforestclassifier
with a random seed and then use select_from_model
.Needless to say, both options require you to pass the same X
and y
data.