Search code examples
pythonmachine-learningscikit-learnpreprocessorone-hot-encoding

OneHotEncoder categories argument


With sklearn 0.22 the categorical_features argument will be removed, thus the following code is not executable anymore:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = np.array([[1, 1], [2, 2], [1, 3]])
encoder = OneHotEncoder(categorical_features=[1], sparse=False)

print(encoder.fit_transform(X))

The question is, how do I achieve the same behavior as in the code above using the categories argument, since OneHotEncoder(categories=[[1, 2], [1, 2, 3]], sparse=False) would also encode the first column and OneHotEncoder(categories=[[1, 2, 3]], sparse=False) throws an Error


Solution

  • OK, so basically you would like to one-hot encode the second column [1,2,3] and keep the first column [1,2,1] as pass through. In newer sklearn versions, you may use ColumnTransformer to combine different preprocessing procedures like this:

    import numpy as np
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    
    X = np.array([[1, 1], [2, 2], [1, 3]])
    encoder = ColumnTransformer(
        [('number1', OneHotEncoder(dtype='int'), [1])],
        remainder="passthrough"
    )
    
    print(encoder.fit_transform(X))
    

    Then you don't have to specify the value range with categories. Refer to the documentation for further details.

    ColumnTransformer