I'm trying to do cross validation (specfically LOOCV) on a simple linear regression model, but for some reason when calculating the score of the process im getting nan
for all the entries. Does anyone know why?
Here is the code:
#use sklearn
from sklearn import model_selection
from sklearn.model_selection import KFold
#now using sklearn repeat linear regression with sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
cv = model_selection.cross_val_score(lr,X,y,cv=len(X))
Here is the data:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
388 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
389 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
390 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
391 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10
392 rows × 9 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg 392 non-null float64
cylinders 392 non-null int64
displacement 392 non-null float64
horsepower 392 non-null int64
weight 392 non-null int64
acceleration 392 non-null float64
year 392 non-null int64
origin 392 non-null int64
name 392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB
If you read the vignette of cross_val_score:
scoring: string, callable, list/tuple, dict or None, default: None [....] If None, the estimator’s score method is used.
For LinearRegression()
this is the R^2 of the prediction. But R^2 doesn't make sense when n=1. Try something like the mean squared error, below I used 'neg_mean_squared_error'
which is the negative of MSE, available from sklearn.metrics.SCORERS.keys()
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
delimiter=r"\s+",header=None,
names=["mpg","cylinders","displacement","horsepower","weight",
"acceleration","model year","origin","car name"],
na_values=['?'])
lr = LinearRegression()
X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)
model_selection.cross_val_score(lr,X,y,cv=len(X),scoring='neg_mean_squared_error')