Search code examples
pythonmachine-learningscikit-learnfeature-selection

Why does plot of RFECV grid scores split into 5 lines?


I'm trying to plot the CV score for recursive feature selection for a wine quality data set. I can't work out why my plot ends up splitting out into 5 lines... My expectation is that it will be 1 line showing the change in CV score for a number of selected features.

rfe.grid_scores_ comes out as a list of lists... I've no idea why its doing this. All examples I have seen plotting grid_scores like this show the plot as one line without any indexing of the list...


dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(dataset_url)

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

# define dataset
x = df.iloc[:, 0:-1]
y = df.quality

# create pipeline
rfe = RFECV(estimator=DecisionTreeClassifier())
model = RandomForestRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
pipeline.fit(x,y)
# evaluate model
cv = RepeatedKFold(n_splits=10)
n_scores = cross_val_score(pipeline, x, y, cv=cv, n_jobs=1, error_score='raise', scoring = 'neg_mean_squared_error')
    
plt.figure(figsize=(12,6))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfe.grid_scores_) + 1), rfe.grid_scores_)
plt.show()

enter image description here


Solution

  • You're plotting five lines because rfe.grid_scores_ returns an array of (number of features, scores per fold) or (11,5). By default, RFECV uses a stratified, 5-fold split. Each line in your plot shows the scores across the number of features for each of the 5 folds.

    If you want to use the specified folds and method, you'll need to pass those to the RFECV.

    dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
    df = pd.read_csv(dataset_url, sep = ';')
    
    # define dataset
    x = df.iloc[:, 0:-1]
    y = df.quality
    
    ### Define the cross-validation strategy here so that it is the same in all parts of your pipeline
    # create pipeline
    cv = RepeatedKFold(n_splits=10) # 10-folds specified by you, 10 repeats (default)
    
    rfe = RFECV(estimator=DecisionTreeClassifier(), cv=cv)
    

    The resulting plot will now plot 100 lines over 11 features, (10 folds times 10 repeats).

    enter image description here

    Notice also that these are the grid scores, not your cross-validation scores. You'll need to plot n_scoresto view the actual cross-validation scores.