Search code examples
pythonscikit-learnlinear-regression

Evaluate Polynomial regression using cross_val_score


I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.

r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583,  0.78710331, -1.67690578,  0.68890253,  0.63120873,
    0.74753825,  0.13937611,  0.18794756, -0.12916661,  0.29576638])

which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?

However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.

r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673

Is this supposed to be a problem?

Here is the code for my model:

X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
       ('model', LinearRegression())]
)
       
## fit the model
pipe.fit(X_train, y_train)    

Solution

  • You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.

    From this result we can safely say that the model is not performing well on this dataset.

    It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)