Search code examples
pythonmachine-learningregressionloss-function

Unexpected R^2 loss value in corss_val_score


I am dealing with a regression dataset, and I wish to fit a particular model to my dataset after evaluating various model's performances. I used cross_val_score from sklearn.model_selection for this purpose. After I chose scoring parameter as 'r2' I got highly negative values for some of my models.

demo = pd.read_csv('demo.csv')
X_train = demo.iloc[0:1460, : ]
Y_train = pd.read_csv('train.csv').loc[:, 'SalePrice':'SalePrice']
X_test = demo.iloc[1460: , : ]

regressors = []
regressors.append(LinearRegression())
regressors.append(Ridge())
regressors.append(Lasso())
regressors.append(ElasticNet())
regressors.append(Lars())
regressors.append(LassoLars())
regressors.append(OrthogonalMatchingPursuit())
regressors.append(BayesianRidge())
regressors.append(HuberRegressor())
regressors.append(RANSACRegressor())
regressors.append(SGDRegressor())
regressors.append(GaussianProcessRegressor())
regressors.append(DecisionTreeRegressor())
regressors.append(RandomForestRegressor())
regressors.append(ExtraTreesRegressor())
regressors.append(AdaBoostRegressor())
regressors.append(GradientBoostingRegressor())
regressors.append(KernelRidge())
regressors.append(SVR())
regressors.append(NuSVR())
regressors.append(LinearSVR())

cv_results = []
for regressor in regressors:
cv_results.append(cross_val_score(regressor, X = X_train, y = Y_train, scoring = 'r2', verbose = True, cv = 10))

After the above mentioned code is compiled and run, cv_results is as follows. It is a list of float64 arrays. Each array contains 10 'r2' value (due to cv = 10).

After the above mentioned code is compiled and run, cv_results is as follows. It is a list of float64 arrays. Each array contains 10 'r2' value (due to cv = 10).

I open the first array and notice that for this particular model, some of the 'r2' values are extremely negative.

enter image description here

Since 'r2' values should be between 0 and 1, why are there very large negative values?


Solution

  • Here's the thing: R^2 values don't actually need to be in [0, 1].

    Essentially, R^2 has a baseline of 0, in that 0 means that your model does no better and no worse than purely taking the mean of the response variable. In OLS where you have an intercept term, this implies that R^2 is in [0, 1].

    However, for other models this is not true in general; for instance, if you fix your intercept in a linear regression model, you could end up doing far worse than just taking the mean of your response.