Search code examples
stringscikit-learncategoriesone-hot-encoding

OneHotEncoder categories with strings


I am having a dataset in the following shape:

type_of_bicycle_lane month other variables
'on the street' 1 ...
'fully seperated' 4 ...
'fully seperated' 8 ...
'fully seperated' 1 ...
'other_1' 12 ...
'other_2' 8 ...

I am trying to pass this trough a scikit learn pipeline, specifically through a OneHotEncoder to use it for various ML models. However, for the type_of_bicycle_lane I only want to create the OneHotEncoding for the categories = ['on the street', 'fully seperated'].

Thus, I wrote the following code:

full_pipeline = ColumnTransformer([
    ("bicycle_lane", OneHotEncoder(categories = ['on the street', 'fully seperated']), ["type_of_bicycle_lane"]),
    ])

However, I keep receiving the Error: Shape mismatch: if categories is an array, it has to be of shape (n_features,).

I consolidated the documentation and I am meant to the parameter categories should be a list of array-like where "categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values." Link to documentation

I have used this OneHotEncoder before (such as

("month", OneHotEncoder(categories = [range(1,13)]), ["month"]) 

which worked perfectly fine.

What might I be doing wrong? Thank you very much in advance!


Solution

  • Categories should be a list of array. Additionnaly you should indicate handle_unknown='ignore' or 'infrequent_if_exist' to avoid an error message.

    full_pipeline = ColumnTransformer([
        ("bicycle_lane", OneHotEncoder(categories = [['on the street', 'fully seperated']], handle_unknown='ignore'), ["type_of_bicycle_lane"]),
        ])