python-3.x scikit-learn linear-regression cross-validation

Different R-squared scores for different times

I just learned about cross-validation and when I give in different arguments, there are different results.

This is the code for building the Regression Model and the R-squared output was about .5 :

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)

Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.

Here's the X_test and y_test arguments:

from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv

The result:

array([ 0.42082715,  0.6507651 , -3.35208835,  0.6959869 ,  0.7770039 ,
        0.59771158,  0.53494622, -0.03308137])

Now when I use the arguments, X_train and y_train, different results are outputted.

from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv

The result:

array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
       0.51262191, 0.53049663, 0.60138472])

Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.

from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv

The output:

array([ 0.61748403,  0.79811218,  0.61559222,  0.6475456 ,  0.61468198,
       -0.7458466 , -3.71140488, -1.17174927])

Which one should I use?

I know this is a long question so Thanks!!

Solution

Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.

Usually the process is:

Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.

scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.

Therefore, you should:

from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)

solely on train set. Test set should be used for other purposes.

What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)