string scikit-learn categories one-hot-encoding

OneHotEncoder categories with strings

I am having a dataset in the following shape:

type_of_bicycle_lane	month	other variables
'on the street'	1	...
'fully seperated'	4	...
'fully seperated'	8	...
'fully seperated'	1	...
'other_1'	12	...
'other_2'	8	...

I am trying to pass this trough a scikit learn pipeline, specifically through a OneHotEncoder to use it for various ML models. However, for the type_of_bicycle_lane I only want to create the OneHotEncoding for the categories = ['on the street', 'fully seperated'].

Thus, I wrote the following code:

full_pipeline = ColumnTransformer([
    ("bicycle_lane", OneHotEncoder(categories = ['on the street', 'fully seperated']), ["type_of_bicycle_lane"]),
    ])

However, I keep receiving the Error: Shape mismatch: if categories is an array, it has to be of shape (n_features,).

I consolidated the documentation and I am meant to the parameter categories should be a list of array-like where "categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values." Link to documentation

I have used this OneHotEncoder before (such as

("month", OneHotEncoder(categories = [range(1,13)]), ["month"])

which worked perfectly fine.

What might I be doing wrong? Thank you very much in advance!

Solution

Categories should be a list of array. Additionnaly you should indicate handle_unknown='ignore' or 'infrequent_if_exist' to avoid an error message.

full_pipeline = ColumnTransformer([
    ("bicycle_lane", OneHotEncoder(categories = [['on the street', 'fully seperated']], handle_unknown='ignore'), ["type_of_bicycle_lane"]),
    ])