I’m running a regression problem and evaluating its performance. I’m wondering how my R-squared values can be so different than my cross-validation scores. Is this a sign of overfitting? Here's an example of my set up. X and Y are predefined as the features and the target respectively.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=47)
knn = neighbors.KNeighborsRegressor(n_neighbors=40, weights='distance')
knn.fit(X_train, y_train)
y_preds_train = knn.predict(X_train)
y_preds_test = knn.predict(X_test)
print('R square of training set:', knn.score(X_train, y_train))
print('_____________________Test Stats_____________________')
print('R square of test in the model:', knn.score(X_test, y_test))
print('MAE:', mean_absolute_error(y_test, y_preds_test))
print('MSE:', mse(y_test, y_preds_test))
print('RMSE:', rmse(y_test, y_preds_test))
print('MAPE:', np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100)
score = cross_val_score(knn, X, Y, cv=5)
print("Cross Val Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std()*2))
score
Results:
R square of training set: 0.9881595480397585 _____________________Test Stats_____________________ R square of test in the model: 0.8611300681864155 MAE: 7.488081625869961 MSE: 164.64697808634588 RMSE: 12.831483861438079 MAPE: 368.35904890846416 Cross Val Accuracy: 0.65 (+/- 0.21) array([0.58122339, 0.53346581, 0.8312428 , 0.69213113, 0.61482638])
While it is difficult to infer from these results, there are a few things that you may want to check out.