After validating the performance of my regression model with cross_validate I obtain some results following the 'r2'
That's what my code is doing
scores = cross_validate(RandomForestRegressor(),X,y,cv=5,scoring='r2')
and what I get is
array([0.47146303, 0.47492019, 0.49350646, 0.56479323, 0.56897343])
For more flexibility, I've also written my own cross validation function which is the following
def my_cross_val(estimator, X, y):
r2_scores = []
kf = KFold(shuffle=True)
for train_index, test_index in kf.split(X,y):[train_index].values, y.iloc[train_index].values)
preds = estimator.predict(X.iloc[test_index].values)
r2 = r2_score(y.iloc[test_index].values, preds)
return np.array(r2_scores)
Running now
scores = my_cross_val(RandomForestRegressor(),X,y)
I obtain
array([0.6975932 , 0.68211856, 0.62892119, 0.64776752, 0.66046326])
Am I doing something wrong in
as the values seem that overestimated compared to cross_validate()
? Maybe putting shuffle=True
inside KFold
In order to be sure that you are comparing apples to apples, and given that shuffling can have a huge difference in such cases, here is what you should do:
First, shuffle your data manually:
from sklearn.utils import shuffle
X_s, y_s = shuffle(X, y, random_state=42)
Then, run cross_validate
with these shuffled data:
scores = cross_validate(RandomForestRegressor(),X_s, y_s, cv=5, scoring='r2')
Change your function to use
kf = KFold(shuffle=False) # no more shuffling (although it should not hurt)
and run it with the already shuffled data:
scores = my_cross_val(RandomForestRegressor(), X_s, y_s)
Now the results should be similar - but not yet identical. You could turn them to identical if you define already kf = KFold(shuffle=False, random_state=0)
before (and outside of the function), and run cross_validate
scores = cross_validate(RandomForestRegressor(), X_s, y_s, cv=kf, scoring='r2') # cv=kf
i.e. using the exact same CV partition in both cases (you should also set the same random_state
to the kf
definition inside the function).