Search code examples
pythonscikit-learnsklearn-pandas

Python sklearn onehotencoder


I'm trying to encode categorical data for the 4th feature of my vector which is in a numpy array. The categories are either '4' or '6'. I can change them into binary by using this:

 features_in_training_set = [[0 0 0 0 4], [0 0 0 0 4], [0 0 0 0 6],[0 0 0 0 4],[0 0 0 0 6]]

 features_in_training_set[:,4] = LabelEncoder().fit_transform(features_in_training_set[:,4]) 

But, of course, I need to change this so that the classifier doesn't think that '4' is greater than '6'. However, when I run the following:

onehotencoder = OneHotEncoder(categorical_features=[4], handle_unknown='ignore')

features_in_training_set = onehotencoder.fit_transform(features_in_training_set).toarray()

The error I'm receiving is:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>

I've checked if I have any missing values or any strings and I don't. All features are integers.

Thanks.


Solution

  • The current OneHotEncoder in scikit-learn (> 0.20) can handle strings or other categorical features itself not requiring to use the LabelEncoder first to encode categories to numbers (or different numbers to unique sorted numbers as you did).

    This error a bug in OneHotEncoder as its been evolving to handle the above case, and in the meanwhile should also support the older use-cases as your question. Adding n_values='auto' to the code will remove this error like this:

    onehotencoder = OneHotEncoder(categorical_features=[4], n_values='auto', 
                                  handle_unknown='ignore')
    

    If you remove the handle_unknown parameter from your code, then also this works, but that should not be done.

    See this issue here: