python machine-learning scikit-learn feature-selection

Why does plot of RFECV grid scores split into 5 lines?

I'm trying to plot the CV score for recursive feature selection for a wine quality data set. I can't work out why my plot ends up splitting out into 5 lines... My expectation is that it will be 1 line showing the change in CV score for a number of selected features.

rfe.grid_scores_ comes out as a list of lists... I've no idea why its doing this. All examples I have seen plotting grid_scores like this show the plot as one line without any indexing of the list...


dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url)

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

# define dataset
x = df.iloc[:, 0:-1]
y = df.quality

# create pipeline
rfe = RFECV(estimator=DecisionTreeClassifier())
model = RandomForestRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
pipeline.fit(x,y)
# evaluate model
cv = RepeatedKFold(n_splits=10)
n_scores = cross_val_score(pipeline, x, y, cv=cv, n_jobs=1, error_score='raise', scoring = 'neg_mean_squared_error')
    
plt.figure(figsize=(12,6))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfe.grid_scores_) + 1), rfe.grid_scores_)
plt.show()

Solution

You're plotting five lines because rfe.grid_scores_ returns an array of (number of features, scores per fold) or (11,5). By default, RFECV uses a stratified, 5-fold split. Each line in your plot shows the scores across the number of features for each of the 5 folds.

If you want to use the specified folds and method, you'll need to pass those to the RFECV.

dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url, sep = ';')

# define dataset
x = df.iloc[:, 0:-1]
y = df.quality

### Define the cross-validation strategy here so that it is the same in all parts of your pipeline
# create pipeline
cv = RepeatedKFold(n_splits=10) # 10-folds specified by you, 10 repeats (default)

rfe = RFECV(estimator=DecisionTreeClassifier(), cv=cv)

The resulting plot will now plot 100 lines over 11 features, (10 folds times 10 repeats).

Notice also that these are the grid scores, not your cross-validation scores. You'll need to plot n_scoresto view the actual cross-validation scores.