Search code examples
pythonmachine-learningcross-validation

Why is my average cross_val_score so different than my R-square values on my training and test sets?


I’m running a regression problem and evaluating its performance. I’m wondering how my R-squared values can be so different than my cross-validation scores. Is this a sign of overfitting? Here's an example of my set up. X and Y are predefined as the features and the target respectively.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=47)
knn = neighbors.KNeighborsRegressor(n_neighbors=40, weights='distance')

knn.fit(X_train, y_train)

y_preds_train = knn.predict(X_train)
y_preds_test = knn.predict(X_test)

print('R square of training set:', knn.score(X_train, y_train))
print('_____________________Test Stats_____________________')
print('R square of test in the model:', knn.score(X_test, y_test))
print('MAE:', mean_absolute_error(y_test, y_preds_test))
print('MSE:', mse(y_test, y_preds_test))
print('RMSE:', rmse(y_test, y_preds_test))
print('MAPE:', np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100)

score = cross_val_score(knn, X, Y, cv=5)
print("Cross Val Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std()*2))
score

Results:

R square of training set: 0.9881595480397585
_____________________Test Stats_____________________
R square of test in the model: 0.8611300681864155
MAE: 7.488081625869961
MSE: 164.64697808634588
RMSE: 12.831483861438079
MAPE: 368.35904890846416
Cross Val Accuracy: 0.65 (+/- 0.21)
array([0.58122339, 0.53346581, 0.8312428 , 0.69213113, 0.61482638])

Solution

  • While it is difficult to infer from these results, there are a few things that you may want to check out.

    1. Changing the value of n_neighbors to over a range and check how the Cross val accuracy changes or even run a GridSearchCV
    2. Change cv in cross_val_score to 10 as it can be seen that one segment reached an accuracy of 0.83 while rest are below 0.7 which is surprising. Change test_size = 0.1 also when you do this
    3. Ensure that the data in X is normalized