Search code examples
pythonmachine-learningscikit-learncross-validationfeature-extraction

Selecting a Specific Number of Features via Sklearn's RFECV (Recursive Feature Elimination with Cross-validation)


I'm wondering if it is possible for Sklearn's RFECV to select a fixed number of the most important features. For example, working on a dataset with 617 features, I have been trying to use RFECV to see which 5 of those features are the most significant. However, RFECV does not have the parameter 'n_features_to_select', unlike RFE (which confuses me). How should I deal with this?


Solution

  • According to this quora post

    The RFECV object helps to tune or find this n_features parameter using cross-validation. For every step where "step" number of features are eliminated, it calculates the score on the validation data. The number of features left at the step which gives the maximum score on the validation data, is considered to be "the best n_features" of your data.

    Which says RFECV determines the optimal number of features (n_features) to get best result.
    The fitted RFECV object contains an attribute ranking_ with feature ranking, and support_ mask to select optimal features found.
    However if you MUST select top n_features from RFECV you can use the ranking_ attribute

    optimal_features = X[:, selector.support_] # selector is a RFECV fitted object
    
    n = 6 # to select top 6 features
    feature_ranks = selector.ranking_  # selector is a RFECV fitted object
    feature_ranks_with_idx = enumerate(feature_ranks)
    sorted_ranks_with_idx = sorted(feature_ranks_with_idx, key=lambda x: x[1])
    top_n_idx = [idx for idx, rnk in sorted_ranks_with_idx[:n]]
    
    top_n_features = X[:5, top_n_idx]
    

    Reference: sklearn documentation, Quora post