Search code examples
pythonmachine-learningscikit-learnclassificationcross-validation

How to perform feature selection (rfecv) in cross validation in sklearn


I want to perform recursive feature elimination with cross validation (rfecv) in 10-fold cross validation (i.e. cross_val_predict or cross_validate) in sklearn.

Since rfecv itself has a cross validation part in its name, I am not clear how to do it. My current code is as follows.

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state = 0, class_weight="balanced")

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

rfecv = RFECV(estimator=clf, step=1, cv=k_fold)

Please let me know how I can use the data X and y with rfecv in 10-fold cross validation.

I am happy to provide more details if needed.


Solution

  • To use recursive feature elimination in conjunction with a pre-defined k_fold, you should use RFE and not RFECV:

    from sklearn.feature_selection import RFE
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import accuracy_score
    from sklearn import datasets
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
    selector = RFE(clf, 5, step=1)
    
    cv_acc = []
    
    for train_index, val_index in k_fold.split(X, y):
        selector.fit(X[train_index], y[train_index])
        pred = selector.predict(X[val_index])
        acc = accuracy_score(y[val_index], pred)
        cv_acc.append(acc)
    
    cv_acc
    # result:
    [1.0,
     0.9333333333333333,
     0.9333333333333333,
     1.0,
     0.9333333333333333,
     0.9333333333333333,
     0.8666666666666667,
     1.0,
     0.8666666666666667,
     0.9333333333333333]