python machine-learning scikit-learn cross-validation

Different results between cross_validate() and my own cross validation function

After validating the performance of my regression model with cross_validate I obtain some results following the 'r2' scoring.

That's what my code is doing

scores = cross_validate(RandomForestRegressor(),X,y,cv=5,scoring='r2')

and what I get is

>>scores['test_score']

array([0.47146303, 0.47492019, 0.49350646, 0.56479323, 0.56897343])

For more flexibility, I've also written my own cross validation function which is the following

def my_cross_val(estimator, X, y):
    
    r2_scores = []
    
    kf = KFold(shuffle=True)
    
    for train_index, test_index in kf.split(X,y):
        
        estimator.fit(X.iloc[train_index].values, y.iloc[train_index].values)
        preds = estimator.predict(X.iloc[test_index].values)
                
        r2 = r2_score(y.iloc[test_index].values, preds)
                    
        r2_scores.append(r2)
        
    return np.array(r2_scores)

Running now

scores = my_cross_val(RandomForestRegressor(),X,y)

I obtain

array([0.6975932 , 0.68211856, 0.62892119, 0.64776752, 0.66046326])

Am I doing something wrong in

my_cross_val()

as the values seem that overestimated compared to cross_validate() ? Maybe putting shuffle=True inside KFold?

Solution

In order to be sure that you are comparing apples to apples, and given that shuffling can have a huge difference in such cases, here is what you should do:

First, shuffle your data manually:

from sklearn.utils import shuffle
X_s, y_s = shuffle(X, y, random_state=42)

Then, run cross_validate with these shuffled data:

scores = cross_validate(RandomForestRegressor(),X_s, y_s, cv=5, scoring='r2')

Change your function to use

kf = KFold(shuffle=False) # no more shuffling (although it should not hurt)

and run it with the already shuffled data:

scores = my_cross_val(RandomForestRegressor(), X_s, y_s)

Now the results should be similar - but not yet identical. You could turn them to identical if you define already kf = KFold(shuffle=False, random_state=0) before (and outside of the function), and run cross_validate as

scores = cross_validate(RandomForestRegressor(), X_s, y_s, cv=kf, scoring='r2') # cv=kf

i.e. using the exact same CV partition in both cases (you should also set the same random_state to the kf definition inside the function).