Search code examples
pythonpandasscikit-learnstatisticscross-validation

why is my cross validation matrix returning nan sklearn


I'm trying to do cross validation (specfically LOOCV) on a simple linear regression model, but for some reason when calculating the score of the process im getting nan for all the entries. Does anyone know why?

Here is the code:

#use sklearn
from sklearn import model_selection
from sklearn.model_selection import KFold
#now using sklearn repeat linear regression with sklearn
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

X = np.array(auto['horsepower']).reshape(-1,1)
y = np.array(auto['mpg']).reshape(-1,1)

cv = model_selection.cross_val_score(lr,X,y,cv=len(X))

Here is the data:

mpg cylinders   displacement    horsepower  weight  acceleration    year    origin  name
0   18.0    8   307.0   130 3504    12.0    70  1   chevrolet chevelle malibu
1   15.0    8   350.0   165 3693    11.5    70  1   buick skylark 320
2   18.0    8   318.0   150 3436    11.0    70  1   plymouth satellite
3   16.0    8   304.0   150 3433    12.0    70  1   amc rebel sst
4   17.0    8   302.0   140 3449    10.5    70  1   ford torino
... ... ... ... ... ... ... ... ... ...
387 27.0    4   140.0   86  2790    15.6    82  1   ford mustang gl
388 44.0    4   97.0    52  2130    24.6    82  2   vw pickup
389 32.0    4   135.0   84  2295    11.6    82  1   dodge rampage
390 28.0    4   120.0   79  2625    18.6    82  1   ford ranger
391 31.0    4   119.0   82  2720    19.4    82  1   chevy s-10
392 rows × 9 columns



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null int64
weight          392 non-null int64
acceleration    392 non-null float64
year            392 non-null int64
origin          392 non-null int64
name            392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB


Solution

  • If you read the vignette of cross_val_score:

    scoring: string, callable, list/tuple, dict or None, default: None [....] If None, the estimator’s score method is used.

    For LinearRegression() this is the R^2 of the prediction. But R^2 doesn't make sense when n=1. Try something like the mean squared error, below I used 'neg_mean_squared_error' which is the negative of MSE, available from sklearn.metrics.SCORERS.keys()

    import pandas as pd
    from sklearn import model_selection
    from sklearn.model_selection import KFold
    from sklearn.metrics import mean_squared_error
    from sklearn.linear_model import LinearRegression
    
    auto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
                     delimiter=r"\s+",header=None,
                     names=["mpg","cylinders","displacement","horsepower","weight",
                            "acceleration","model year","origin","car name"],
                       na_values=['?'])
    
    lr = LinearRegression()
    X = np.array(auto['horsepower']).reshape(-1,1)
    y = np.array(auto['mpg']).reshape(-1,1)
    
    model_selection.cross_val_score(lr,X,y,cv=len(X),scoring='neg_mean_squared_error')