Search code examples
pythonscikit-learnrfe

How is scikit-learn's RFECV `cv_results` attribute ordered by?


I fit an RFECV instance on my training data using a binary classifier clf.

My training data has 154 features and I used 10-fold cross-validation to drop five features per iteration.

rfecv = RFECV(
    estimator=clf,
    step=5,
    min_features_to_select=10,
    cv=10,
    scoring='precision',
    verbose=10,
    n_jobs=1,
    importance_getter='auto'
)

I do not understand how the resulting rfecv.cv_results_ dictionary is ordered by. After turning it to a pandas dataframe, I noticed that the number of rows corresponds to the number of features tested at each step (i.e., 154, 149, 145, ..., 24, 19, 14).

However, I'd like to know which row number corresponds to which number of features. For example, does the first row represent the 10 models that used 154 features?

My mean test scores (rfecv.cv_results_.get('mean_test_score')), in their current order, look as follows:

RFECV mean test scores

It seems to me that the results are ranked in ascending order (first the 10 models with 14 features, then 19, then 24, etc.). However, this seems counter-intuitive to me because the elimination process is recursive, which is why I'm asking for help.


Solution

  • You are correct, the process of feature elimination is to start with all available features and reduce them step by step.

    However, the results are indeed sorted with an ascending number of features. To prove this, the relevant part in the source code is line 783ff.:

    # reverse to stay consistent with before
    scores_rev = scores[:, ::-1]
    self.cv_results_ = {}
    self.cv_results_["mean_test_score"] = np.mean(scores_rev, axis=0)
    self.cv_results_["std_test_score"] = np.std(scores_rev, axis=0)
    
    for i in range(scores.shape[0]):
        self.cv_results_[f"split{i}_test_score"] = scores_rev[i]
    

    To make your DataFrame more understandable, you can add the number of features to your DataFrame like this:

    cv_results_df['num_features'] = 154 - cv_results_df.index * 5