Search code examples
pythonscikit-learnregressioncross-validation

Wrong data for cross-validation (doesn't work)


I want to determine the best regularization coefficient α in the regression problem in the process of 5-fold cross-validation.

And when I run the following simple code for this, the error is thrown:

alphas = np.logspace(-6, 2, 200)
skf = StratifiedKFold(n_splits=5)
lasso_cv = LassoCV(alphas=alphas, random_state=17, max_iter=5000)

for k, (train, test) in enumerate(skf.split(X_train_scaled, y_train)):
    lasso_cv.fit(X_train_scaled[train], y_train[train])

    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(skf, lasso_cv.alpha_, lasso_cv.score(X_train_scaled[test], y_train[test]))
         )
for k, (train, test) in enumerate(skf.split(X_train_scaled, y_train)):
----> 6     lasso_cv.fit(X_train_scaled[train], y_train[train])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I essentially rewrote the code from here (very bottom)


I have no nan or inf values:

where_XNaNs = np.isnan(X_train_scaled)
where_yNaNs = np.isnan(y_train)
print(X_train_scaled[where_XNaNs])
print(y_train[where_yNaNs])

print()

where_Xinfs = np.isinf(X_train_scaled)
where_yinfs = np.isinf(y_train)
print(X_train_scaled[where_Xinfs])
print(y_train[where_yinfs])
[]
Series([], Name: quality, dtype: int64)

[]
Series([], Name: quality, dtype: int64)

Solution

  • The person who helped didn't want to write an answer, so it'll be me.

    Need change y_train[train] to y_train.iloc[train] (y_test the same).