scikit-learn regression nan grid-search kaggle

Input contains NaN, infinity or a value too large.. when using gridsearchcv, scoring = 'neg_mean_squared_log_error'

I was working on a Kaggle data set 'Santander Value Prediction Challenge'

lasso = Lasso()
lasso_para = {'alpha' :[0.001,0.01,0.02]}
gs = GridSearchCV(estimator = lasso, 
                 param_grid = lasso_para,
                 cv = 10,
                 scoring = 'neg_mean_squared_log_error',
                 verbose = 2)
gs.fit(train,df_y)

An error was raised when I try to using GridSearchCV to fit the training set.

File "C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

All columns are float 64:

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 1894 entries, 0 to 1893
dtypes: float64(1894)
memory usage: 64.4 MB

df_y.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 1 columns):
target    4459 non-null float64
dtypes: float64(1)
memory usage: 34.9 KB

• I checked both training set and y values using sum(dataset.isnull().sum()), both outputs are 0.

sum(train.isnull().sum())
Out[46]: 0

sum(df_y.isnull().sum())
Out[47]: 0

• This error only happens when I set scoring = 'neg_mean_squared_log_error' but works fine while using MSE.

• No errors are found if I fit the entire training set without k-fold cross-validation.

lasso.fit(train,df_y)
Out[48]: 
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

• All y pred are positive while using predict function based on the entire training set.

y_pred_las = lasso.predict(train)
min(y_pred_las)
Out[50]: 26.871339344757036
np.isnan(y_pred_las).any()
Out[59]: False

• Error will only be raised using linear regressors such as lasso, ridge and elasticnet.

• No errors are found while using tree based regressor such as XGB and lightGBM.

• My training set has about 4600 rows with 1900 variables after applying PCA , when I fit GridSearchCV separately with variables from 1 to 500, 500 to 100, 1000 to 1500 and 1500 to 1900, no errors are found.

I was still unable to find out the reason why for the error after all these trials, has anyone had similar situation before and knows why?

Hopefully a kind soul could help me out!

Cheers!

Solution

You can solve this error by adding this line. I am also Kaggler and also face a similar problem.

An error will only be raised using linear regressors such as lasso, ridge and elasticnet not in tree-based regressors such as XGB and lightGBM because that lightgbm and XGB handle missing value by itself. But in linear regression sci-kit learn model not handle missing value by itself, so we have to perform some pre-processing task.

your dataset may contain a null value, missing value, inf values. so, we have to fill missing value and clip the infinite value to some range.

To add this line in sci-kit learn model which solve your issue.

df = df.fillna(df.median()).clip(-1e11,1e11)