Search code examples
pythonpandasfeature-selectionrfe

performing rfec in python and understanding output


I am doing rfecv in python using pandas. my step size is 1. I start with 174 features. My function call is as below

rfecv = RFECV(estimator=LogisticRegression(solver='lbfgs'), step=1, cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=44),scoring='recall',\
              min_features_to_select=30, verbose=0)
rfecv.fit(X_train, y['tag'])

Optimal number of features returned by rfecv is 89. I noticed that length of cv_results_['mean_test_score'] is 145.

Shouldn't it be 174-89=85? If RFECV removes 1 feature at a time and ends up with 89 features out of 174 then I felt that there will be 85 steps (length of 'mean_test_score').

#adding some dummy example-------------------------

In below case, we start with 150 features. minimum features to select is 3 and it selects 4 features. But then why print (len(selector.cv_results_['std_test_score'])) is 148 if 1 feature is eliminated at a time

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=150, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5, min_features_to_select=3)
selector = selector.fit(X, y)
print (selector.support_)
print (selector.ranking_)
print (selector.n_features_)

print (len(selector.cv_results_['std_test_score']))

Solution

  • In your first example, you start with 174 features and end up with 89 features as the optimal number. The length of cv_results_['mean_test_score'] being 145 is due to the cross-validation process. The RFECV method does not just eliminate one feature at a time but performs cross-validation at each step to estimate the model's performance with a different number of features. So, it evaluates the model multiple times during the feature selection process with different subsets of features.

    You start with 174 features. The RFECV starts the feature elimination process, evaluates the model's performance using cross-validation (here, with 10-fold stratified cross-validation), and records the mean test score for each step. After eliminating some features, the process enters the second step, where the number of features might differ from 174 (depending on how many were eliminated in the first step). The process continues until it reaches the stopping criterion, either the min_features_to_select or the number of features where the performance does not improve significantly.

    The length of cv_results_['mean_test_score'] will give you the number of steps taken during the feature elimination process, which may not equal the difference between the initial and final number of features selected.

    In your second example with 150 features, when you set min_features_to_select=3, the process will select at least 3 features. However, it might select more features if that leads to better performance during cross-validation. Hence, the length of cv_results_['std_test_score'] is 148, indicating that the RFECV has evaluated the model's performance at 148 different steps (with different subsets of features).

    I created a simple example with 10 features to demonstrate the RFECV process :

    import numpy as np
    import pandas as pd
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.feature_selection import RFECV
    from sklearn.model_selection import StratifiedKFold
    
    # Generate synthetic data with 10 features and 100 samples
    X, y = make_classification(n_samples=100, n_features=10, random_state=42)
    
    # Define the estimator and RFECV parameters
    estimator = LogisticRegression(solver='lbfgs')
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=44)
    step_size = 1
    min_features_to_select = 3
    
    # Create the RFECV object and fit it to the data
    rfecv = RFECV(estimator=estimator, step=step_size, cv=cv, scoring='accuracy', 
                  min_features_to_select=min_features_to_select, verbose=0)
    rfecv.fit(X, y)
    
    # Get the optimal number of features selected
    optimal_num_features = rfecv.n_features_
    
    # Get the mean test scores during the feature selection process
    mean_test_scores = rfecv.cv_results_['mean_test_score']
    
    # Print the results
    print("Optimal number of features selected:", optimal_num_features)
    print("Number of steps in RFECV:", len(mean_test_scores))
    

    The output :

    enter image description here

    Since min_features_to_select=3, the RFECV process will select at least 3 features. The number of steps in RFECV will depend on how many features are eliminated during the cross-validation process at each step.