Search code examples
python-3.xscikit-learnpreprocessorone-hot-encodingint64

Need a Work-around for OneHotEncoder Issue in SKLearn Preprocessing


So, it seems that OneHotEncoder won't work with the np.int64 datatype (only np.int32)! Here's a sample of code:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

a = np.array([[56748683,8511896545,51001984320],[18643548615,28614357465,56748683],[8511896545,51001984320,40084357915]])
b = pd.DataFrame(a, dtype=np.int64)

ohe = OneHotEncoder()
c = ohe.fit_transform(b).toarray()

When I run this I get the following error: "ValueError: X needs to contain only non-negative integers."

As you can see, X DOES contain only non-negative integers! When I trim a few of the digits and change the datatype to int32 it works fine:

a = np.array([[56748,8511896,51001984],[18643548,28614357,56748],[8511896,51001984,40084357]])
b = pd.DataFrame(a, dtype=np.int32)
ohe = OneHotEncoder()
c = ohe.fit_transform(b).toarray()

Unfortunately, the data I need to encode has 11 digits (which can't be represented by int32). So, any suggestions would be helpful...

Also, I should mention, I don't necessarily need a one hot encoding, just need to create dummy variables. Thanks!


Solution

  • Pandas has a get_dummies function that creates dummy variables:

    import numpy as np
    import pandas as pd
    
    a = np.array([[56748683,8511896545,51001984320],[18643548615,28614357465,56748683],[8511896545,51001984320,40084357915]])
    b = pd.DataFrame(a, dtype=np.int64)
    b = b.astype('object')
    c = pd.get_dummies(b)