Search code examples
pythonpandasmachine-learningscikit-learnvalueerror

ValueError:Input contains NaN, infinity or a value too large for dtype('float64') even when isnan and isinf are false and dtype=float64


My code is to analyze the PUBG dataset from kaggle and make a model. I have extracted all the features and Standardised them using StandardScaler from sklearn.

//Snippet

X=standardized_data
y=training_features_output
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)
print(standardized_data.shape,training_features_output.shape)

[Output]: (4446966, 16) (4446966,)

print(np.all(np.isinf(standardized_data)))
print(np.all(np.isinf(training_features_output)))
print(np.all(np.isnan(standardized_data)))
print(np.all(np.isnan(training_features_output)))

[Output]:
False
False
False
False

print(X.dtype)
print(y.dtype)

[Output]:
dtype('float64')
dtype('float64')

model=LinearRegression()
model.fit(X_train,y_train)
y_train_pred=model.predict(X_train)
y_test_pred=model.predict(X_test)
print('Train r2_accuracy:',r2_score(y_train,y_train_pred))
print('Test r2_accuracy:',r2_score(y_test,y_test_pred))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

error photo
Full Code

From the above outputs we can see that they are no nan and infinite values in the dataset and also the data is in float64. but how am I getting this error and how to resolve it?
Tried other queries regarding this on stackoverflow all were having nan or something messed up and I dont know where is this code messing up.


Solution

  • Your checking point is not correct because you are checking if all the data are inf using np.all().

    print(np.all(np.isinf(standardized_data)))
    ...
    

    Instead, use np.any().

    Proof:

    a = [np.inf, 0, 1]
    
    np.all(np.isinf(a))
    #False
    
    np.any(np.isinf(a))
    #True