As mentioned in the title, Im using SelectFromModel from sklearn to select features for both my random forest and gradient boosting classification models.
#feature selection performed on training dataset to prevent overfitting
sel = SelectFromModel(GradientBoostingClassifier(n_estimators=10, learning_rate=0.25,max_depth=1, max_features = 15, random_state=0).fit(X_train_bin, y_train))
sel.fit(X_train_bin, y_train)
#returns a boolean array to indicate which features are of importance (above the mean threshold)
sel.get_support()
#shows the names of the selected features
selected_feat= X_train_bin.columns[(sel.get_support())]
selected_feat
The boolean array that is returned for random forest and gradient boosting model are COMPLETELY different. random forest feature selection tells me to drop an additional 4 columns (out of 25 features) and the feature selection on the gradient boosting model is telling me to drop nearly everything. What is happening here?
EDIT: I'm trying to compare the performance of these 2 models on my dataset. Should I move the threshold so I at least have approximately the same amount of features to train on?
There's no reason for them to select the same variables. GradientBoostingClassifier
builds each tree to improve on the error of the previous step, while RandomForestClassifier
trains independent trees that have nothing to do with each others' errors.
Another reason why they may select different features is criterion
, which is entropy for Random Forests and Friedman MSE for Gradient Boosting. Finally, it could be because both algorithms select random subsets of features when making each split. Hence, they did not compare the same variables in the same order, which will naturally yield different importances.