Search code examples
pythonscikit-learncross-validation

Score of RFECV() in python scikit-learn


Scikit-learn library supports recursive feature elimination(RFE) and its cross validation version(RFECV). RFECV is very useful for me it selects small features, but I wonder how cross validation of RFE is done.

RFE is way to reduce least important features features. So I thought RFECV will calculate cross validation score removing feature 1 by 1.

But if cross validation is used, I think each fold will select other features for their least important because data is different. Does someone know how feature is removed in RFECV?


Solution

  • The cross validation is done on the number of features. Each CV iteration updates the score for each number of removed features.

    It then picks a number n_features_to_select of features to keep, based on the score, and uses RFE on the complete dataset keeping only n_features_to_select features.

    From the source:

    for n, (train, test) in enumerate(cv):
        X_train, y_train = _safe_split(self.estimator, X, y, train)
        X_test, y_test = _safe_split(self.estimator, X, y, test, train)
    
        rfe = RFE(estimator=self.estimator,
                  n_features_to_select=n_features_to_select,
                  step=self.step, estimator_params=self.estimator_params,
                  verbose=self.verbose - 1)
    
        rfe._fit(X_train, y_train, lambda estimator, features:
                 _score(estimator, X_test[:, features], y_test, scorer))
        scores.append(np.array(rfe.scores_[::-1]).reshape(1, -1))
    scores = np.sum(np.concatenate(scores, 0), 0)
    # The index in 'scores' when 'n_features' features are selected
    n_feature_index = np.ceil((n_features - n_features_to_select) /
                              float(self.step))
    n_features_to_select = max(n_features_to_select,
                               n_features - ((n_feature_index -
                                             np.argmax(scores)) *
                                             self.step))
    # Re-execute an elimination with best_k over the whole set
    rfe = RFE(estimator=self.estimator,
              n_features_to_select=n_features_to_select,
              step=self.step, estimator_params=self.estimator_params)
    rfe.fit(X, y)