Search code examples
pandasscikit-learnsklearn-pandas

ValueError for sklearn, problem maybe caused by float32/float64 dtypes?


So I want to check the feature importance in a dataset, but I get this error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I checked the dataset and fair enough there were nan values. So I added a line to drop all nan rows. Now there are no nan values. I re-ran the code and still the same error. I checked the .dtypes and fair enough, it was all float64. So I added .astype(np.float32) to the columns that I pass to sklearn. But now I still have the same error. I scrolled through the entire dataframe manually and also used data.describe() and all values are between 0 and 5, so far away from infinity or large values.

What is causing the error here?

Here is my code:

import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt


data = pd.read_csv("data.csv")
data.dropna(inplace=True) #dropping all nan values


X = data.iloc[:,8:42]  
X = X.astype(np.float32) #converting data from float64 to float32
y = data.iloc[:,4]    
y = y.astype(np.float32) #converting data from float64 to float32


# feature importance
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)

Solution

  • You are in the third case (large value) then in the second case (infinity) after the downcast:

    Demo:

    import numpy as np
    
    a = np.array(np.finfo(numpy.float64).max)
    # array(1.79769313e+308)
    
    b = a.astype('float32')
    # array(inf, dtype=float32)
    

    How to debug? Suppose the following array:

    a = np.array([np.finfo(numpy.float32).max, np.finfo(numpy.float64).max])
    # array([3.40282347e+038, 1.79769313e+308])
    
    a[a > np.finfo(numpy.float32).max]
    # array([1.79769313e+308])