Scikit-learn library supports recursive feature elimination(RFE) and its cross validation version(RFECV). RFECV is very useful for me it selects small features, but I wonder how cross validation of RFE is done.
RFE is way to reduce least important features features. So I thought RFECV will calculate cross validation score removing feature 1 by 1.
But if cross validation is used, I think each fold will select other features for their least important because data is different. Does someone know how feature is removed in RFECV?
The cross validation is done on the number of features. Each CV iteration updates the score for each number of removed features.
It then picks a number n_features_to_select
of features to keep, based on the score, and uses RFE on the complete dataset keeping only n_features_to_select
features.
From the source:
for n, (train, test) in enumerate(cv):
X_train, y_train = _safe_split(self.estimator, X, y, train)
X_test, y_test = _safe_split(self.estimator, X, y, test, train)
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, estimator_params=self.estimator_params,
verbose=self.verbose - 1)
rfe._fit(X_train, y_train, lambda estimator, features:
_score(estimator, X_test[:, features], y_test, scorer))
scores.append(np.array(rfe.scores_[::-1]).reshape(1, -1))
scores = np.sum(np.concatenate(scores, 0), 0)
# The index in 'scores' when 'n_features' features are selected
n_feature_index = np.ceil((n_features - n_features_to_select) /
float(self.step))
n_features_to_select = max(n_features_to_select,
n_features - ((n_feature_index -
np.argmax(scores)) *
self.step))
# Re-execute an elimination with best_k over the whole set
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, estimator_params=self.estimator_params)
rfe.fit(X, y)