Search code examples
pythonnan

StandardScaler -ValueError: Input contains NaN, infinity or a value too large for dtype('float64')


I have the following code

X = df_X.as_matrix(header[1:col_num])
scaler = preprocessing.StandardScaler().fit(X)
X_nor = scaler.transform(X) 

And got the following errors:

  File "/Users/edamame/Library/python_virenv/lib/python2.7/site-packages/sklearn/utils/validation.py", line 54, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I used:

print(np.isinf(X))
print(np.isnan(X))

which gives me the output below. This couldn't really tell me which element has issue as I have millions of rows.

[[False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]
 ..., 
 [False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]]

Is there a way to identify which value in the matrix X actually cause the problem? How do people avoid it in general?


Solution

  • numpy contains various logical element-wise tests for this sort of thing.

    In your particular case, you will want to use isinf and isnan.

    In response to your edit:

    You can pass the result of np.isinf() or np.isnan() to np.where(), which will return the indices where a condition is true. Here's a quick example:

    import numpy as np
    
    test = np.array([0.1, 0.3, float("Inf"), 0.2])
    
    bad_indices = np.where(np.isinf(test))
    
    print(bad_indices)
    

    You can then use those indices to replace the content of the array:

    test[bad_indices] = -1